Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning
نویسندگان
چکیده
Constrained reinforcement learning (CRL), also termed as safe learning, is a promising technique enabling the deployment of RL agent in real-world systems. In this paper, we propose successive convex approximation based off-policy optimization (SCAOPO) algorithm to solve general CRL problem, which formulated constrained Markov decision process (CMDP) context average cost. The SCAOPO on solving sequence objective/feasibility problems obtained by replacing objective and constraint functions original problem with surrogate functions. proposed enables reuse experiences from previous updates, thereby significantly reducing implementation cost when deployed engineering systems that need online learn environment. spite time-varying state distribution stochastic bias incurred feasible initial point can still provably converge Karush-Kuhn-Tucker (KKT) almost surely.
منابع مشابه
Stochastic Successive Convex Approximation for Non-Convex Constrained Stochastic Optimization
This paper proposes a constrained stochastic successive convex approximation (CSSCA) algorithm to find a stationary point for a general non-convex stochastic optimization problem, whose objective and constraint functions are nonconvex and involve expectations over random states. The existing methods for non-convex stochastic optimization, such as the stochastic (average) gradient and stochastic...
متن کاملParallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments ...
متن کاملOn-Policy vs. Off-Policy Updates for Deep Reinforcement Learning
Temporal-difference-based deep-reinforcement learning methods have typically been driven by off-policy, bootstrap Q-Learning updates. In this paper, we investigate the effects of using on-policy, Monte Carlo updates. Our empirical results show that for the DDPG algorithm in a continuous action space, mixing on-policy and off-policy update targets exhibits superior performance and stability comp...
متن کاملData-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...
متن کاملOff-Policy Shaping Ensembles in Reinforcement Learning
Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel without sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensembl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Signal Processing
سال: 2022
ISSN: ['1053-587X', '1941-0476']
DOI: https://doi.org/10.1109/tsp.2022.3158737